JPEG Quantization optimized for Pentium® II Processor

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

Third-party brands and names are the property of their respective owners.

1.0. Introduction

This application note shows optimizing techniques used to gain substantial performance improvement on the Quantization step in JPEG compression, running on a Pentium II Processor and a Pentium Processor with MMX(TM) Technology. Current JPEG Quantization takes 64 DCT frequency components, divides them by a "quantizer step size", and rounds them to integers to form quantized coefficients. C code and MMX(TM) Technology assembly implementations are presented. Performance results for both implementations are also summarized. The code provided in this application note can be plugged directly into the IJG (Independent JPEG Group) royalty free software with minimum code modifications to take advantage of MMX(TM) Technology. The modifications are listed in the Code Listing (Section 6.0).

2.0. JPEG

JPEG uses the properties of the human eye to achieve 10 to 80 times compression. The JPEG baseline model consists of four stages: a transformation stage, a lossy quantization stage, and two lossless coding stages. The initial transform concentrates the information energy into the first few transform coefficients, the quantizer causes a controlled loss of information, and the two coding stages further compress the data. The YCbCr color space separates luminance(Y) from chrominance(Cb, Cr). The compression algorithm takes advantage of the fact that the human eye is more sensitive to luminance than chrominance. The sensitivity of the eye also increases at low intensity levels. The individual color components in the YCbCr color space are less correlated than in the RGB color space. Therefore this model can be applied to compress each YCbCR component individually.

2.1. Quantization Algorithm

The Quantization step is used to reduce the magnitude of DCT coefficients and to increase the number of zero value coefficients based on the eye's ability to detect different levels at a given frequency. The values are chosen to match the sensitivity of the eye. Small quantization values are chosen for low frequency and higher values for high frequency coefficients. The JPEG baseline model is considered a "lossy" compressor because the reconstructed image is not identical to the original. Lossless coders, which create images identical to the original, achieve inferior compression sizes than JPEG.

In preparation of the transformation step, the image is broken up into 8X8 pixels for each color component across the image. Video energy of the 8X8 blocks is scattered throughout the elements. If the variation of this video energy is slow across the image, a transform is used to concentrate this energy into few coefficients -- 2 dimensional DCT coefficients. The uniform midstep quantizer is used for the JPEG baseline method, where the step size is varied according to the coefficient location and which color component is encoded. Two separate Quantization tables are used, one each for luminance and chrominance. The equation for the quantizer can be written as:

_{Quantified
Coefficients = DCT Frequency Coefficient/ Quantizer}

The decompression step uses the inverse quantizer:

_{DCT
Frequency Coefficient = Quantified Coefficients * Quantizer}

Quantization is the lossy stage in the JPEG coding scheme. If quantization is too coarse, images look "blocky" but if its too fine, useless bits are spent coding (essentially) noise. Quantization can be controlled by the Quality Factor, a number which changes the default quantization matrix by an effective multiplicative factor. A lower quality image gives better compression and vise-versa.

2.2. Quality Factor

Most JPEG compressors let you pick a file size versus image quality tradeoff by selecting a quality setting. For good quality full color source images, the default IJG quality setting (Q75) is very often the best choice. If the image was not high quality to begin with, dropping down to Q50 will not cause much degradation. Q95 is about the highest recommended quality. Q100 will generate a file 2 to 3 times larger than Q95 without much improvement. Images with sharp color edges may need higher quality setting to avoid jagged edges.

3.0 Optimization with MMX(TM) Technology

First step in optimizing was VTune profiling. This step identified the functions to concentrate on for the purpose of optimization. Optimizing the highly utilized functions allowed better overall performance gain.

3.1. Eliminating Data-dependent Branches

The original code takes the equation for the Quantizer and rounds into integers by adding half the denominator to the numerator and then performing an integer divide:

_{Quantized
Coefficients = (DCT Frequency Coefficient + (Quantizer/2)) /
Quantizer}

There were two data dependent branches in the original C code. The first one was to detect if the denominator is greater than numerator and set it to zero. This algorithm avoids the slow divide and is an "early out" mechanism. The second was to detect a whether the DCT Frequency Coefficient was positive or negative and round accordingly. If the DCT Frequency Coefficient was negative, the original program takes a data dependent branch and performs the rounding operation. A negative rounder is added to a negative coefficient and a positive is added to a positive coefficient. The Pentium II processor has a sophisticated branch prediction algorithm and random sign changes in the DCT coefficients will cause unpredictable branches. This penalty is 9 to 26 clock cycles for high performance Pentium II Processors. The first branch was eliminated altogether and the ratios and the rounders were precalculated and stored in a table. The second data dependent branch was eliminated all together by using a simpler rounding method without introducing any detectable error. The rounder was the same whether the data was positive or negative. This removes the data dependent branch.

3.2. Eliminating Divides

Division is much slower than multiplication in the Pentium Processor and Pentium II Processors. Dividing by a constant is inefficient and should be precalculated. The C code equation of generating quantified coefficients contains 64 divides by constants for every 8X8 pixel block in the picture. This could be costly because divides in general are time consuming, are data dependent and non pipelined. This is an area where the JPEG Quantization code performance can be increased. The divisor Q table for both luminance and chrominance are already setup. Two new tables containing (2^16)/Q values are precalculated, multiplied with individual values of (F + Q/2) and shifted right 16 bits to accommodate for the 2^16 multiply. Here is the complete picture:

_{Quantized
Coefficients = (DCT Frequency Coefficient + (Quantizer/2)) /
Quantizer}
_{= [(DCT
Frequency Coefficient +}_Quantizer/2_{) *}_{(2^16/Quantizer)} _{] >> 16}

Creation of two new tables take a one time hit of 64 divides but that is negligible compared to 64 divides per 8X8 pixel block in a picture. The multiplication with "(2^16/Quantizer)" and the shift right using ">>16" is accomplished with the MMX(TM) Technology instruction PMULHW.

3.3. Loop Unrolling

Unrolling the loop of the MMX Technology implementation from processing 4-words 16 times to processing 32-words twice gives a 25% performance increase. However, after a certain point, the improvement diminishes at the expense of code size increase and does not make sense.

4.0. Performance Results

The table below takes the cycle-count range for the C code implementation and compares it to the average of the MMX Technology implementation. In the C code implementation there are clearly two peaks for this picture, indicating the unpredictability of data dependent branching. The single peak MMX(TM) Technology implementation has no data dependent branch. If the result of quantization and rounding is zero then the C code has an early out mechanism causing the code to avoid the divide. This is due to the fact that divides are costly on processor cycles. The high performance algorithm removes the two data dependent branches and divides by precalculating a table.

Two deviations from the original IJG code are listed below:

Modifying all quant table elements from "int" to "short" in file "jcparam.c" caused no errors. The 16/32 bit was a choice given to the user.
The error introduced due to the simple rounding is negligible. In every other block of 64 numbers, a number is off by one.

**Table 1: Performance Results**
		Pentium II Processor	Pentium II Processor	Pentium Processor with MMX(TM) Technology	Pentium Processor with MMX(TM) Technology
		C Routine	Optimized Routine	C Routine	Optimized Routine
Cycles	Range	2529 - 9270	248 - 894	2809 - 13126	335 - 2610
Cycles	Average	3360	311	3930	406
Performance Gain compared to C	Minimum	1X	8X	1X	7X
Performance Gain compared to C	Average	1X	10.8X	1X	9.6X
Overall improvement in JPEG compression		1X	3X	1X	3X

NOTES:

1) MMX technology in-line assembly code is compiled using Microsoft Visual C++ 5 with the compiler options set to produce Pentium code and optimization set to maximum speed.

2) Performance gain compared to C implementation = Cycle Count of C routine / Average Cycle Count of MMX routine.

3) System configuration:

Pentium II Processor: 266MHz, 32MB memory, 11ms HD seek time

Pentium Processor with MMX(TM) Technology: 233MHz, 64MB memory, 11ms HD seek time

The following graphs plot the cycles required to perform quantization of 8X8 blocks versus the percentage of the total number of blocks (16K). The four cases compare the Pentium II and Pentium Processor with MMX(TM) Technology with C code and Optimized implementations on the same photographic image. Due to the zero-early-out algorithm used in C code implementation, the peaks of the C code implementation will vary from one image to another.

5.0. Conclusion

This application note shows a successful use of MMX(TM) Technology instructions and Pentium II Processor optimized code to implement a JPEG Quantization algorithm. The optimized implementation demonstrated an approximately10X performance gain when compared to the original C implementation. The gain can be attributed to the substitution of divides with multiplies, removal of data dependent branch and SIMD instructions resulting in multiplication of four values in parallel in a single instruction.

6.0. Code Listing

By taking advantage of Intel's CPUID instruction, software developers can create software applications and tools that can execute compatibly across the widest range of Intel processor generations and models, past, present, and future. If after running CPUID instruction it is determined that the machine is capable of running MMX(TM) Technology instructions, a global switch needs to be turned "on" to take advantage of the MMX(TM) Code. If CPUID determines a processor without MMX(TM) Technology, the scalar C code can be used as a default path, maintaining compatibility with all previous generations of Intel processors.

jcdctmgr.c is the modified file. The file contains both the C code and MMX(TM) Technology implementations. To choose the MMX(TM) Technology code turn "on" "MMXAvailable" switch. For a detailed description of using CPUID instruction, see Intel Application note AP-485. Visit http://developer.intel.com/design/perftool/cpuid/ for source and DLL.

If you wish to run cjpeg.exe you need to download them from the following site. Free, portable C code for JPEG compression is available from the Independent JPEG Group. Source code, documentation, and test files are included.Version 6a is available from ftp.uu.net:/graphics/jpeg/jpegsrc.v6a.tar.gz.If you are on a PC you may prefer ZIP archive format, which you can find at ftp.simtel.net:/pub/simtelnet/msdos/graphics/jpegsr6a.zip (or at any Simtel mirror site). On CompuServe, see the Graphics Support forum(GO CIS:GRAPHSUP), library 12 "JPEG Tools", file jpegsr6a.zip. IJG requires all users of their code to read the readme.txt file.

Modifications required to run this code:

In file "Jdct.h" modify: "typedef int DCTELEM;" to "typedef short int DCTELEM;"
In file "jcparam.c" modify all quant table elements from "int" to "short"

#include <stdio.h> 
/*
 * jcdctmgr.c
 *
 * Copyright (C) 1994-1996, Thomas G. Lane.
 * This file is part of the Independent JPEG Group's software.
 * For conditions of distribution and use, see the accompanying README file.
 *
 * This file contains the forward-DCT management logic.
 * This code selects a particular DCT implementation to be used,
 * and it performs related housekeeping chores including coefficient
 * quantization.
 */
#define JPEG_INTERNALS
#include "jinclude.h"
#include "jpeglib.h"
#include "jdct.h"		/* Private declarations for DCT subsystem */
/* Private subobject for this module */
typedef struct {
  struct jpeg_forward_dct pub;	/* public fields */
  /* Pointer to the DCT routine actually in use */
  forward_DCT_method_ptr do_dct;
  /* The actual post-DCT divisors --- not identical to the quant table
   * entries, because of scaling (especially for an unnormalized DCT).
   * Each table is given in normal array order.
   */
  DCTELEM * divisors[NUM_QUANT_TBLS];
#ifdef DCT_FLOAT_SUPPORTED
  /* Same as above for the floating-point case. */
  float_DCT_method_ptr do_float_dct;
  FAST_FLOAT * float_divisors[NUM_QUANT_TBLS];
#endif
} my_fdct_controller;
typedef my_fdct_controller * my_fdct_ptr;
/*
 * Initialize for a processing pass.
 * Verify that all referenced Q-tables are present, and set up
 * the divisor table for each one.
 * In the current implementation, DCT of all components is done during
 * the first pass, even if only some components will be output in the
 * first scan.  Hence all components should be examined here.
 */
/*NEW CODE ADDED
mmx_rounders is an array of two 64 entries corresponding to the two 
Quantization tables - Luminance and Chrominance.
*/
DCTELEM mmx_rounders[NUM_QUANT_TBLS][DCTSIZE2];	
METHODDEF(void)
start_pass_fdctmgr (j_compress_ptr cinfo)
{
  my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct;
  int ci, qtblno, i;
  jpeg_component_info *compptr;
  JQUANT_TBL * qtbl;
  DCTELEM * dtbl;
  for (ci = 0, compptr = cinfo->comp_info; ci < cinfo->num_components;
       ci++, compptr++) {
    qtblno = compptr->quant_tbl_no;
    /* Make sure specified quantization table is present */
    if (qtblno < 0 || qtblno >= NUM_QUANT_TBLS ||
	cinfo->quant_tbl_ptrs[qtblno] == NULL)
      ERREXIT1(cinfo, JERR_NO_QUANT_TABLE, qtblno);
    qtbl = cinfo->quant_tbl_ptrs[qtblno];
    /* Compute divisors for this quant table */
    /* We may do this more than once for same table, but it's not a big deal */
    switch (cinfo->dct_method) {
#ifdef DCT_ISLOW_SUPPORTED
    case JDCT_ISLOW:
      /* For LL&M IDCT method, divisors are equal to raw quantization
       * coefficients multiplied by 8 (to counteract scaling).
       */
      if (fdct->divisors[qtblno] == NULL) {
	fdct->divisors[qtblno] = (DCTELEM *)
	  (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
				      DCTSIZE2 * SIZEOF(DCTELEM));
      }
      dtbl = fdct->divisors[qtblno];
      for (i = 0; i < DCTSIZE2; i++) {
	dtbl[i] = ((DCTELEM) qtbl->quantval[i]) << 3;
      }
      break;
#endif
#ifdef DCT_IFAST_SUPPORTED
    case JDCT_IFAST:
      {
	/* For AA&N IDCT method, divisors are equal to quantization
	 * coefficients scaled by scalefactor[row]*scalefactor[col], where
	 *   scalefactor[0] = 1
	 *   scalefactor[k] = cos(k*PI/16) * sqrt(2)    for k=1..7
	 * We apply a further scale factor of 8.
	 */
#define CONST_BITS 14
	static const INT16 aanscales[DCTSIZE2] = {
	  /* precomputed values scaled up by 14 bits */
	  16384, 22725, 21407, 19266, 16384, 12873,  8867,  4520,
	  22725, 31521, 29692, 26722, 22725, 17855, 12299,  6270,
	  21407, 29692, 27969, 25172, 21407, 16819, 11585,  5906,
	  19266, 26722, 25172, 22654, 19266, 15137, 10426,  5315,
	  16384, 22725, 21407, 19266, 16384, 12873,  8867,  4520,
	  12873, 17855, 16819, 15137, 12873, 10114,  6967,  3552,
	   8867, 12299, 11585, 10426,  8867,  6967,  4799,  2446,
	   4520,  6270,  5906,  5315,  4520,  3552,  2446,  1247
	};
	SHIFT_TEMPS
	if (fdct->divisors[qtblno] == NULL) {
	  fdct->divisors[qtblno] = (DCTELEM *)
	    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
					DCTSIZE2 * SIZEOF(DCTELEM));
	}
	dtbl = fdct->divisors[qtblno];
	for (i = 0; i < DCTSIZE2; i++) {
	  dtbl[i] = (DCTELEM)
	    DESCALE(MULTIPLY16V16((INT32) qtbl->quantval[i],
				  (INT32) aanscales[i]),
		    CONST_BITS-3);
	}
      }
      break;
#endif
#ifdef DCT_FLOAT_SUPPORTED
    case JDCT_FLOAT:
      {
	/* For float AA&N IDCT method, divisors are equal to quantization
	 * coefficients scaled by scalefactor[row]*scalefactor[col], where
	 *   scalefactor[0] = 1
	 *   scalefactor[k] = cos(k*PI/16) * sqrt(2)    for k=1..7
	 * We apply a further scale factor of 8.
	 * What's actually stored is 1/divisor so that the inner loop can
	 * use a multiplication rather than a division.
	 */
	FAST_FLOAT * fdtbl;
	int row, col;
	static const double aanscalefactor[DCTSIZE] = {
	  1.0, 1.387039845, 1.306562965, 1.175875602,
	  1.0, 0.785694958, 0.541196100, 0.275899379
	};
	if (fdct->float_divisors[qtblno] == NULL) {
	  fdct->float_divisors[qtblno] = (FAST_FLOAT *)
	    (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE,
					DCTSIZE2 * SIZEOF(FAST_FLOAT));
	}
	fdtbl = fdct->float_divisors[qtblno];
	i = 0;
	for (row = 0; row < DCTSIZE; row++) {
	  for (col = 0; col < DCTSIZE; col++) {
	    fdtbl[i] = (FAST_FLOAT)
	      (1.0 / (((double) qtbl->quantval[i] *
		       aanscalefactor[row] * aanscalefactor[col] * 8.0)));
	    i++;
	  }
	}
      }
      break;
#endif
    default:
      ERREXIT(cinfo, JERR_NOT_COMPILED);
      break;
    }	//end of case
/*NEW CODE ADDED
If an MMX machine is detected:
"mmx_rounders" is used to round the Quantized values to integers. It is 
an array of two 64 entries corresponding to the two Quantization tables - 
Luminance and Chrominance. The idea is to store the rounding values (half 
the Quantization table value) in this array and add them to the DCT 
frequency components later.
dtbl[] is overwritten with (2^16)/dtbl[]. This quantity can be multiplied 
to the sum of DCT frequency components and their rounding factors. The 
"divides" can thus be converted into "multiplies" speeding up the process
significantly.
*/
  if (MMXAvailable)
	{
	for (i=0; i<DCTSIZE2; i++)
		{
		mmx_rounders[qtblno][i]=dtbl[i]>>1;
		dtbl[i]=( 65536 + (dtbl[i]>>1))/dtbl[i];	//16bits
		}
	}
  }		//end of loop
}		//end of func
/*
 * Perform forward DCT on one or more blocks of a component.
 *
 * The input samples are taken from the sample_data[] array starting at
 * position start_row/start_col, and moving to the right for any additional
 * blocks. The quantized coefficients are returned in coef_blocks[].
 */
METHODDEF(void)
forward_DCT (j_compress_ptr cinfo, jpeg_component_info * compptr,
	     JSAMPARRAY sample_data, JBLOCKROW coef_blocks,
	     JDIMENSION start_row, JDIMENSION start_col,
	     JDIMENSION num_blocks)
/* This version is used for integer DCT implementations. */
{
  /* This routine is heavily used, so it's worth coding it tightly. */
  my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct;
  forward_DCT_method_ptr do_dct = fdct->do_dct;
  DCTELEM * divisors = fdct->divisors[compptr->quant_tbl_no];
  DCTELEM workspace[DCTSIZE2];	/* work area for FDCT subroutine */
  JDIMENSION bi;
DCTELEM *workspaceptr = workspace;
JCOEFPTR output;
int i,j;
  sample_data += start_row;	/* fold in the vertical offset once */
  for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE)
   {
    /* Load data into workspace, applying unsigned->signed conversion */
    { register DCTELEM *workspaceptr;
      register JSAMPROW elemptr;
      register int elemr;
      workspaceptr = workspace;
if (!MMXAvailable)
{
	for (elemr = 0; elemr < DCTSIZE; elemr++) 
	{
		elemptr = sample_data[elemr] + start_col;
		//printf("%i \n", sizeof(*workspaceptr));
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
		*workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE;
	}
		}
 else //if (MMXAvailable)
{
	__int64 centersamp64=0x0080008000800080;
	__asm {
		mov	eax, workspaceptr
		mov ebx, sample_data
		mov edx, start_col
		pxor	mm7,mm7
		mov ecx, [ebx+0]
		add ecx, edx		//sample_data[0]+start_col
		movq	mm6,centersamp64
			
		movq	mm0,[ecx]
		movq	mm1,mm0
		punpcklbw mm0,mm7
		punpckhbw mm1,mm7
		psubw	mm0,mm6
		psubw	mm1,mm6
		movq	[eax],mm0
		movq	[eax+8],mm1
		mov ecx, [ebx+1*4]
		add ecx, edx		//sample_data[1]+start_col
		movq	mm2,[ecx]
		movq	mm3,mm2
		punpcklbw mm2,mm7
		punpckhbw mm3,mm7
		psubw	mm2,mm6
		psubw	mm3,mm6
		movq	[eax+16],mm2
		movq	[eax+24],mm3
		mov ecx, [ebx+2*4]
		add ecx, edx		//sample_data[2]+start_col
		movq	mm4,[ecx]
		movq	mm5,mm4
		punpcklbw mm4,mm7
		punpckhbw mm5,mm7
		psubw	mm4,mm6
		psubw	mm5,mm6
		movq	[eax+32],mm4
		movq	[eax+40],mm5
		mov ecx, [ebx+3*4]
		add ecx, edx		//sample_data[3]+start_col
		movq	mm0,[ecx]
		movq	mm1,mm0
		punpcklbw mm0,mm7
		punpckhbw mm1,mm7
		psubw	mm0,mm6
		psubw	mm1,mm6
		movq	[eax+48],mm0
		movq	[eax+56],mm1
		mov ecx, [ebx+4*4]
		add ecx, edx		//sample_data[4]+start_col
		movq	mm2,[ecx]
		movq	mm3,mm2
		punpcklbw mm2,mm7
		punpckhbw mm3,mm7
		psubw	mm2,mm6
		psubw	mm3,mm6
		movq	[eax+64],mm2
		movq	[eax+72],mm3
		mov ecx, [ebx+5*4]
		add ecx, edx		//sample_data[5]+start_col
		movq	mm4,[ecx]
		movq	mm5,mm4
		punpcklbw mm4,mm7
		punpckhbw mm5,mm7
		psubw	mm4,mm6
		psubw	mm5,mm6
		movq	[eax+80],mm4
		movq	[eax+88],mm5
		mov ecx, [ebx+6*4]
		add ecx, edx		//sample_data[6]+start_col
		movq	mm0,[ecx]
		movq	mm1,mm0
		punpcklbw mm0,mm7
		punpckhbw mm1,mm7
		psubw	mm0,mm6
		psubw	mm1,mm6
		movq	[eax+96],mm0
		movq	[eax+104],mm1
		mov ecx, [ebx+7*4]
		add ecx, edx		//sample_data[7]+start_col
		movq	mm2,[ecx]
		movq	mm3,mm2
		punpcklbw mm2,mm7
		punpckhbw mm3,mm7
		psubw	mm2,mm6
		psubw	mm3,mm6
		movq	[eax+112],mm2
		movq	[eax+120],mm3
//		emms		//done later, after quant
}
      }
    }
    /* Perform the DCT */
    (*do_dct) (workspace);
if (MMXAvailable)
	{
JCOEFPTR output_ptr = coef_blocks[bi];

__int64 pos_one =0x0001000100010001;
__int64 neg_one =0xffffffffffffffff;
//loading the address of mmx_rounders.
DCTELEM *round_tbl = mmx_rounders[compptr->quant_tbl_no];
output=output_ptr;
/* NEW CODE
Quantize/descale the coefficients, and store into coef_blocks[] 
The original C-code took the DCT frequency coefficients and divided them
by the values of the quant tables and rounded them to integers. Since 
divides are inefficient, they were converted to multiplies using the 
following technique:
									
(DCT+QuantVal/2)/QuantVal = ( (DCT+QuantVal/2) * (2^16/QuantVal) ) >> 16
QuantVal/2 is precalculated and sotred in mmx_rounders.
(2^16/QuantVal) is precalculated and stored in divisors. The old divisors
are now multipliers. Finally, the pmulhw performs the "multiply" and the
implied ">> 16" in one operation causing the performance gain.
Also eleminated the branch for negetive DCT frequency Coefficients.
*/
__asm {
	xor	ebx, ebx				//zero the count
	mov	esi, workspaceptr		//load data
	mov	ecx, divisors			//load quantization multipliers
	mov	edx, round_tbl			//load rounding table
	mov eax, neg_round_tbl		//load negative rounding table
	mov	edi, output_ptr			//load storage
	movq  mm4, neg_one			//FFFF FFFF FFFF FFFF
	movq  mm5, pos_one			//0001 0001 0001 0001
quant_loop:
	movq	mm0,[esi+ebx]		//load data
	movq	mm3, mm0			//save copy of data
	pxor	mm7, mm7			//clear mm7
	movq	mm1,[edx+ebx]		//load rounder
	movq    mm2, [ecx+ebx]		//load quantization multipliers
	pcmpgtw	mm7, mm3			//generate mask 
								//negative words = FFFF
								//positive words = 0000
	pxor	mm0, mm4			//1's complement all
	paddw	mm0, mm5			//2's complement to flip the sign
	paddw	mm3, mm1			//add rounding factor to pos#
	paddw	mm0, mm1			//add rounding factor to neg#
	pmulhw	mm3, mm2			//multiply pos# and shift right 16bits
	pmulhw	mm0, mm2			//multiply neg# and shift right 16bits
	pxor	mm0, mm4			//1's complement all
	paddw	mm0, mm5			//2's complement to flip the sign
	pand	mm0, mm7			//mask and save the neg
	pandn	mm7, mm3			//mask and save the pos
	por		mm0, mm7			//combine pos and neg in mm0
	movq	[edi+ebx],mm0		//store data
	add		ebx,8				//add 8 bytes (4 words)
	cmp		ebx,128				//done yet (64 words*2bytes=128)
	jne		quant_loop
	emms
	}
}
else
{	//not MMX
#ifdef FAST_DIVIDE
#define DIVIDE_BY(a,b)	a /= b
#else
#define DIVIDE_BY(a,b)	if (a >= b) a /= b; else a = 0
#endif
    {// register DCTELEM temp, qval;
	  DCTELEM temp, qval;
      register int i;
      register JCOEFPTR output_ptr = coef_blocks[bi];
output=output_ptr;
      for (i = 0; i < DCTSIZE2; i++) {
	qval = divisors[i];
	temp = workspace[i];
	if (temp < 0) {
	  temp = -temp;
	  temp += qval>>1;	// for rounding 
	  DIVIDE_BY(temp, qval);
	  temp = -temp;
	} 
	else {
	  temp += qval>>1;	// for rounding 
	  DIVIDE_BY(temp, qval);
	}
	output_ptr[i] = (JCOEF) temp;
      }
    }
}//end of not MMX
  }
}

Code Listing 1: C and MMX(TM)Technology Implementation of JPEG Quantization of DCT coefficients.

* Legal Information © 1998 Intel Corporation